Write incremental results after each task completion#93
Open
juppytt wants to merge 6 commits intopinchbench:mainfrom
Open
Write incremental results after each task completion#93juppytt wants to merge 6 commits intopinchbench:mainfrom
juppytt wants to merge 6 commits intopinchbench:mainfrom
Conversation
Session transcripts were deleted between tasks by cleanup_agent_sessions,
making post-run debugging impossible. Now transcripts are copied to
results/{run_id}_transcripts/{task_id}.jsonl before cleanup.
Also fixes pre-existing duplicate _remove_readonly function definition
that caused a SyntaxError on import.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When --judge is specified with a model ID, the judge calls the model API directly instead of running an OpenClaw agent session. This avoids OpenClaw personality files (SOUL.md, IDENTITY.md) overriding the judge's JSON-only grading instructions, which caused all llm_judge tasks to score 0. Supported model prefixes: - openrouter/* -> OpenRouter API (OPENROUTER_API_KEY) - anthropic/* -> Anthropic Messages API (ANTHROPIC_API_KEY) - openai/* -> OpenAI chat completions (OPENAI_API_KEY) - claude -> headless Claude CLI (claude -p) Without --judge, behavior is unchanged (OpenClaw agent session). Also fixes pre-existing duplicate _remove_readonly function definition in lib_agent.py that caused an IndentationError. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The function was defined twice on consecutive lines with the second definition shadowing the first. Also removed an extra bare func(path) call outside the try/except block.
Update the result JSON after every task finishes grading so external tools can poll progress while the benchmark is still running. The partial result includes in_progress=true, completed_tasks, and total_tasks fields. The final write at the end overwrites without these fields.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
in_progress: true,completed_tasks, andtotal_tasksfieldsThis enables dashboards and monitoring tools to display per-task progress while a benchmark run is still in progress.